Using Tabular Data in Python

csv module

Python has a csv reader/writer as part of its built in library. It is called csv. This is the simplest way to read tabular data (data in table format). The type of data you used to use excel to process (hopefully you will try out python now). It must be in text format to use the csv module, so .csv (comma separated) or .tsv (tab separated) Here is the documentation: https://docs.python.org/2/library/csv.html

To use it, first you must import it

import csv



In [ ]:

    
import csv

Next we create a csv reader. You give it a file handle and optionally the dialect, the separator (usually commas or tabs), and the quote character.

with open(filename, 'r') as fh:
    reader = csv.reader(fh, delimiter='\t', quotechar='"')



In [ ]:

    
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')

The reader doesn't do anything yet. It is a generator that allows you to loop through the data (it is very similar to a file handle).

To loop through the data you just write a simple for loop

for row in reader:
    #process row

The each row will be a list with each element corresponding to a single column.



In [ ]:

    
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    for row in reader:
        print(row)

TRY IT

Open up the file workout.txt (tab delimited, tab='\t') with the csv reader and print out each row.



In [ ]:

Doesn't that look nice?

Well there are a few problems that I can see. First the header, how do we deal with that?

Headers

The easiest way I have found is to use the next method (that is available with any generator) before the for loop and to store that in a header variable. That reads the first line and stores it (so that you can use it later) and then advances the pointer to the next line so when you run the for loop it is only on the data.

header = reader.next()
for row in reader:
    #process data



In [ ]:

    
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    for row in reader:
        print(row)  
    print("Header", header)

Values are Strings

Notice that each item is a string. You'll need to remember that and convert things that actually should be numbers using the float() or int() functions.



In [ ]:

    
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    for row in reader:
        float_row = [float(row[0]), float(row[1])]
        print(float_row)

TRY IT

Open workouts with a csv reader. Save the header line to a variable called header. Convert each value in the data rows to ints and print them out.



In [ ]:

Analyzing our data

You can use just about everything we have learned up until this point to analyze your data: if statements, regexs, math, data structures. Let's look at some examples.



In [ ]:

    
# Let's find the average distance for all walks. 

with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    # Empty list for storing all distances
    walks = []
    for row in reader:
        #distance is in the first column
        dist = row[0]
        # Convert to float so we can do math
        dist = float(dist)
        # Append to our list
        walks.append(dist)
    
    # Use list aggregation methods to get average distance
    ave_dist = sum(walks) / len(walks)
    print("Average distance walked: {0:.1f}".format(ave_dist))



In [ ]:

    
# Let's see our pace for each walk
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    for row in reader:
        #distance is in the first column
        dist = row[0]
        # Convert to float so we can do math
        dist = float(dist)
        #time in minutes is in the second column
        time_minutes = row[1]
        # Convert to float so we can do math
        time_minutes = float(time_minutes)
        # calculate pace as minutes / kilometer
        pace = time_minutes /dist
        print("Pace: {0:.1f} min/km".format(pace))
        
# If you want a challenge, try to make this seconds/mile



In [ ]:

    
# We can filter data. Let's get the ave pace only for walks longer than
# 3 km

# Let's see our pace for each walk
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    paces = []
    for row in reader:
        #distance is in the first column
        dist = row[0]
        # Convert to float so we can do math
        dist = float(dist)
        # Don't count short walks
        if dist >= 3.0:
            #time in minutes is in the second column
            time_minutes = row[1]
            # Convert to float so we can do math
            time_minutes = float(time_minutes)
            pace = time_minutes /dist
            paces.append(pace)
            
ave_pace = sum(paces) / len(paces)
print("Average walking pace: {0:.1f} min/km".format(ave_pace))

Here is something I do all the time. It is a little more complicated than the above examples, so take your time trying to understand it. What I like to do is to read the csv data and transform it to a dictionary of lists. This allows me to use it in many different ways later in the code. It is most useful with larger dataset that I will be analyzing and using many different times. (You can even print it out as JSON!)



In [ ]:

    
# Lets see our pace for each walk
with open('walks.csv', 'r') as fh:
    reader = csv.reader(fh, delimiter=',')
    header = next(reader)
    # This is the dictionary we will put our data from the csv into
    # The key's are the column headers and the values is a list of
    # all the data in that column (transformed into floats)
    data = {}
    # Initialize our dictionary with keys from header and values
    # as empty lists
    for column in header:
        data[column] = []
    for row in reader:
        # Enumerate give us the index and the value so 
        # we don't have to use a count variable
        for index, column in enumerate(header):
            # convert data point to float
            data_point = float(row[index])
            # append data to dictionary's list for that column
            data[column].append(data_point)
    # look at that beautiful data. You can do anything with that!
    print(data)

TRY IT

Find the average number of squats done from the workouts.txt file. Feel free to copy the code for opening from the previous TRY IT.



In [ ]:

Writing CSVs

The csv module also contains code for writing csvs.

To write, you create a writer using the writer method and give it a filehandle and optionally delimiter and quotechar.

with open('my_file.csv', 'w') as fh:
    writer = csv.writer(fh, delimiter=',', quotechar='"')

Then use the writerow method with a list to write as it's argument.

writer.writerow([item1, item2])



In [ ]:

    
import random
with open('sleep.csv', 'w') as fh:
    writer = csv.writer(fh, delimiter='\t', quotechar='"')
    header = ['day', 'sleep (hr)']
    writer.writerow(header)
    for i in range(1,11):
        hr_sleep = random.randint(4,10)
        writer.writerow([i, hr_sleep])
        
#open the file to prove you wrote it. (Open in excel for best results)

TRY IT

Write the following data to a file called age_walking_pace.csv:

age | pace
----------
5   | 12.5
15  | 9.4
18  | 7.8
25  | 8.1
48  | 9.2
91  | 105.1

Separator is comma, quote char is "



In [ ]:

Project

The file fiveK.csv contains data from the top 100 female finishers for the Firecracer 5k held in Reston on the 4th of July 2016 http://www.prraces.com/firecracker/

Look at the data.
Then use your csv reading skills to read in the data using the csv reader.
Parse out the header and store in a variable called header.
Choose one of the following to calculate
- Easy: Calculate the pace for each runner and print it out.
- Medium Easy: Calculate the average time of all runners
- Medium: Calculate the average pace for those under 40 and those over 40
- Medium Hard: Calculate the average time for those from VA vs. those who traveled to attend the race
- Hard: Did people who registered early have faster times than those who registered late? Calculate the average time for the registration number in 500 increment chunks. i.e. Ave pace for registration number 1-500, 501-1000, etc.
- Hardest: Do people who put their city in all caps do better or worse than those who put their city in mixed case or lower case. (HINT: use regexs)
If you picked medium or above. Print out the category and the result into a csv called results.csv



In [ ]: